[feat] Support Qwen3_5 Training#143
Conversation
There was a problem hiding this comment.
This part looks to be unnecessary. Can directly use vision iterable
There was a problem hiding this comment.
Same for this one. Seems no need for overriding the load from json method?
kcz358
left a comment
There was a problem hiding this comment.
Checkpoint the hf and transformers repo, seems like the qwen3.5 uses the exact same logic as qwen3. So I think all the data processing class can use the qwen3 processor and dataset.
There was a problem hiding this comment.
One thing to notice here. Qwen3 5 uses hybrid attention, linear+full. Can we just use the flops function for qwen2?
kcz358
left a comment
There was a problem hiding this comment.
LGTM for me. I think the estimate flops is a bit inaccurate. If can't sure what's the flop function for gated delta net, maybe can leave it empty or wait to see if we can copy from verl etc. :)
* feat(models): add transformers 5.0 compatibility Conditionally import models incompatible with transformers >= 5.0: - dream_dllm, qwen3_dllm, llada_dllm require transformers < 5.0 - llava_onevision1_5 requires transformers < 5.0 - Dynamically update __all__ based on transformers version - Prevents ImportError when using transformers 5.0+ * fix(train): add group_by_length for backward compatibility Add group_by_length parameter to TrainingArguments to maintain compatibility with existing training configurations. * feat(deps): allow transformers >= 4.57.1 Update transformers dependency from exact version to minimum version to support transformers 5.0+ while maintaining backward compatibility. * style: auto-fix lint (black + isort) * refactor(processor): replace additional_special_tokens with all_special_tokens Use all_special_tokens for transformers >= 5.0 compatibility while maintaining backward compatibility with transformers < 5.0. Changes: - Add special_tokens property to all processor classes - Use all_special_tokens if available (transformers >= 5.0) - Fall back to additional_special_tokens (transformers < 5.0) - Add <|im_start|> and <||im_end|> tokens to special_tokens list - Cache special_tokens as instance attribute for performance Affected processors: - AeroDataProcessor (base class) - BaseQwen2_5_DataProcessor (inherits from AeroDataProcessor) - Qwen2VLDataProcessor - Qwen2DataProcessor - LLaVADataProcessor - LLaVAVideoDataProcessor (inherits from LLaVADataProcessor) - NanovlmDataProcessor - Qwen3_VLDataProcessor (inherits from BaseQwen2_5_DataProcessor) * style: auto-fix lint (black + isort) * refactor(processor): unify apply_chat_template usage Use processor.apply_chat_template with tokenize=True consistently across all processors instead of mixing with processor.tokenizer calls. Changes: - aero_processor: use processor.apply_chat_template(tokenize=True)[0] - base_qwen2_5_processor: use processor.apply_chat_template(tokenize=True)[0] - qwen2_vl_processor: use processor.apply_chat_template(tokenize=True) - qwen3_vl_processor: use processor.apply_chat_template(tokenize=True)[0] This ensures all processors return token IDs directly during data preparation, improving consistency and reducing confusion. * feat(models): add common_ops for transformer-agnostic rope index Extract rope index calculation functions into common_ops/rope.py to ensure consistent behavior across transformers versions. Changes: - Add common_ops/rope.py with qwen2_5_vl_rope_index and qwen3_vl_get_rope_index - Update qwen2_5_vl_ops.py to use qwen2_5_vl_rope_index - Update qwen3_vl_ops.py to use qwen3_vl_get_rope_index - Update qwen3_vl_moe_ops.py to use qwen3_vl_get_rope_index This ensures rope index calculations remain stable even when transformers internal implementations change. * fix(utils): add B200/B300 GPU FLOPS support Add NVIDIA B200/B300 GPU FLOPS (2.25e15) to get_device_flops() to fix MFU calculation returning 0 on B200 GPUs. Previously, unknown GPU types returned inf FLOPS, causing MFU to always be 0. * Lint * fix(models): qwen2_5_vl transformers 5.0 compatibility - Fix vision_model variable reference in liger kernel patch - Support nested text_config in lce_forward - Handle rope_scaling/rope_parameters for transformers 5.0+ - Add qwen2_5_vl to FlopsCounter model type mapping * refactor(processor): use DataUtilities.apply_chat_template for transformers 5.0 compatibility - Add apply_chat_template utility method to DataUtilities - Handles dict-like return values (BatchEncoding) with use_key param - Handles nested list wrapping from some processors - Update all processors to use unified method * feat(launch): add filter_training_args for transformers 5.0 compatibility Filter unsupported TrainingArguments parameters by inspecting transformers.TrainingArguments.__init__ signature, avoiding errors from deprecated or removed parameters in newer versions. * fix(models): add parse_visual_output for transformers 5.0 compatibility Visual model methods (get_image_features, get_video_features, visual()) may return tuples OR dataclass objects (BaseModelOutputWithPooling, BaseModelOutputWithDeepstackFeatures) in transformers 5.0+. Add parse_visual_output() to transparently handle both return types. * [feat] Support Qwen3_5 Training (#143) * [feat] Support Qwen3_5 Training * style: auto-fix lint (black + isort) * [feat] Support Qwen3.5 Training * optimize qwen3.5 dataset process logic * optimize qwen3.5 dataset process logic * flop function leave empty --------- Co-authored-by: charlesswu <charlesswu@tencent.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * fix(processor): remove duplicate special_tokens property in qwen2_vl_processor * fix(models): remove duplicate .to() calls in qwen2_5_omni_liger * fix(models): define input_ids_rmpad in inputs_embeds branch to avoid NameError * refactor(models): extract parse_visual_output to common_ops/visual.py * refactor(processor): extract special_tokens logic to DataUtilities.get_special_tokens * style: auto-fix lint (black + isort) * docs: add Transformers 5.0 migration guide Add comprehensive migration guide for transformers 5.0 compatibility. Includes compatibility matrix, installation instructions, and troubleshooting for Qwen3.5 (requires >= 5.3.0) and legacy models (requires < 5.0.0). --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: wukeming <108406625+KemingWu@users.noreply.github.com> Co-authored-by: charlesswu <charlesswu@tencent.com> Co-authored-by: mwxely <yang0756@e.ntu.edu.sg>
Motivation
Modifications
Commit Message Convention
Please follow our standardized commit message format:
[feat]- New features or functionality[fix]- Bug fixes[docs]- Documentation changes only[style]- Code style changes (formatting, missing semicolons, etc.)[refactor]- Code refactoring without changing functionality[perf]- Performance improvements[test]- Adding or updating tests[chore]- Maintenance tasks, dependency updates, etc.[ci]- CI/CD configuration changesExamples:
[feat] add qwen omni iterable dataset support[fix] resolve bagel model configuration error[docs] update training guide with YAML examplesSee CONTRIBUTING.md for more details.
CI/CD Checks
Your PR will automatically run the following checks:
black(line-length=120) and import sorting withisortpre-commit run --all-fileslocally to verify before pushingChecklist
pre-commit run --all-filesand ensure all checks passblack(line-length=120) andisort